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1. INTRODOCTICN 


Least-squares (LS) estimation is a basic operation in many signal process 
ing problems. Given y=Ax+v, where A is a mxn coefficient matrix, y is a 
mxl observation vector , and v is a mxl zero mean white noise vector, a 
simple least-squares solution is finding x which minimizes ||Ax-y||. It is 
well known that for an ill-conditioned matrix A, solving least-squares 
problems by orthogonal triangular (QR) decomposition and back substitution 
has robust numerical properties under finite word length effect since 
2-norro is preserved. Many fast algorithms have been proposed and applied 
to systolic arrays. Gentleman-Kung (1981) first presented the triangular 
systolic array for a basic Givens reduction. Mcwhirter (1983) used this 
array structure to find the least-squares estimation errors. Then by 
geometric approach, several different systolic array realizations of the 
recursive least-squares estimation algorithms of Lee et al (1981) were 
derived by Kalson-Yao (1985) . We consider basic QR decomposition algo 
ritrtns and find that under one-row time updating situation, the House 
holder transformation degenerates to a simple Givens reduction. Next, we 
derive an improved least-squares estimation algorithm by considering a 
modified version of fast Givens reduction. From this approach, the basic 
relationship between Givens reduction and Modified-Gram-Scftnidt transfor 
mation can easily be understood. We also can see this improved algorithm 
has simpler computational and inter-cell connection complexities while 
compared with other known least-squares algorithms and is more realistic 
for systolic array implementation. 

Minimum variance estimation (popularized by Kalman (I960)) is the general 
ized form of a least-squares problem, where the state vector x is charac 
terized by the state equation x k+1 c Fx k +w, the system noise w and the 
observation noise v are colored. The original algorithm presented by Kal 
man can have poor numerical property. Some algorithms for improving 
numerical properties, such as square-root covariance and square-root 
information methods have been studied. Now, we find that after the whi 
tening processing, this minimum variance estimation can be formulated as 
the modified square-root information filter and be solved by the simple 
least-squares processing. This new approach contains advantages in both 
numerical accuracy as well as computational efficiency as compared to the 
original Kalman algorithm. Since all these processings can be implemented 
by systolic arrays, high throughput rate computation for Kalman filtering 
problems become feasible. 



2. SIMPLE LEAST-SQUARES ESTIMATION 


Given the equation b=Ax+v, it is well known that we can solve the least- 
squares solution $ by normal equation. However, this approach not only 
requires tne computation of a matrix inverse but also doubles the condi 
tion number when we form A'A. Although using singular value decomposition 
for least-squares solution can improve numerical properties, the computa 
tional complexity involved in SVD is not low. Besides, fast algorithm for 
SVD is still underdevelopment. Lattice structure for least-squares solu 
tion was proposed and studied by Lee et al (1981) . This approach was 
shown to have stable numerical property and regular hardware structure. 
However, this method required shifting property of the coefficient matrix 
and can not apply to all general cases. QR decomposition is another solu 
tion to obtain x, since 2-norm is preserved by multiplying an orthogonal 
matrix Q, then by letting QA=R be a upper triangular matrix, the x can be 
obtained by using back substitution for the equation Rx=&. This approach 
has robust numerical properties since the 2-norm is fixed, the rounding 
error caused by finite word length effect will not grow. Basically, there 
are three ways for performing QR decomposition, namely, Householder trans 
formation, Givens reduction, and Modif ied -Gram- Schmidt orthogonalization. 
It can be shown that under one row time updating situation (as in the sys 
tolic array implementation) , the Householder transformation matrix will 
degenerate to a simple Givens reduction case. 

Systolic array implementation for QR decompositions in least-squares esti 
nation was first explored by Gentleman-Kung and followed by MCWhirter and 
Kalson-Yao, By using a triangular systolic array, it was shown that the 
estimation error for the last observation can be solved at every clock 
period. The systolic array structure for least-squares estimation is 
shown in Figure 2.1. To achieve fully pipelined operation, the input rows 
are skewed and propagated like wavefronts in the diagonal direction. There 
are only two basic processing units, boundary cell and internal cell, are 
required by this systolic array. Communication between different process 
ing units are all local. The properties of regularity and local cormuni 
cation are consistent with the philosophy of VLSI implementation. Sunmary 
of input/output formats and operation functions for two kinds of process 
ing units are shewn in Table 1 and Table 2 respectively. 


Table 1. input/CXitput format of systolic array algorithms 
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The above symbols are for notations only, their physical meaning may 
change for different algorithms. 

Table 2. Operational functions of processing units 
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Fran systolic array point of view, the difference between algorithms pro 
posed by Mtf'fhirter and Kalson-Yao lies in the basic computations in two 
kinds of processing units. Since these algorithms were derived from two 
different approaches, specifically Givens reduction and Modified-Gram- 
Schmidt orthogonal izat ion, the basic relationship for. these two QR decompo 
sition methods under one row time updating can be compared as follows. 
First, we derived the modified expression for the fast-Givens reduction as 
given by 
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the updating equation for this roodified-fast-Givens algorithm becomes, 

Boundary cell: d'=d + x 2 /(l/o) (l/o')=(l/o) + x 2 /d 

Internal cell: b'=b - (x/d)*dk d'k'=dk + b*x/(l/o) [1] 


By comparing the computational complexity between the fast Givens algo 
rithm by Gentleman (1973) and that in II] , we can see II] has one mul ti 






plication less than the original algorithm. And since we do not have 
interest on the real rotated elements like (1/ ►'SDdk., we do not have the 
risk of dividing by a very small d. The numerical properties of the modi 
fied algorithm is then expected to comparable to the numerical properties 
of the original one. By equation [1], the basic duality associations 
between Givens reduction and Modi fied -Gram- Schmidt orthogonalization is 
summarized in Table 3, which allows us to derive different algorithms for 
least-squares estimation from different approaches with efficiency. 


Table 3. IXiality association for M-G-S and Fast-Givens reduction. 
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With systolic array implementation, comparison of computational complexity 
for algorithms discussed above can be made by comparing the number of 
operations required in each processing unit. When the dimension of the 
coefficient matrix becomes large, wavefront array processing of Rung 
(1983) becomes more appropriate for the control scheme. In this case, the 
speed of this "wavefront" will be decided by the slowest processing unit 
along each wavefront. In modified fast Givens algorithm, equations for 
boundary cell are non-recursive and can be done in parallel if we can 
double the computational capability of each boundary cell. In this case, 
the wavefront speed and then the throughput rate can be doubled. The sys 
tolic array we discussed above will generate estimation error at each 
clock period. While the estimated vector £ is not shown explicitly, St can 
be solved by back substitution which can be done by just appending a nxn 
identity matrix after the coefficient matrix A. 


3. MINIMUM VARIANCE ESTIMATIONS AND KALMAN FILTERING 


Often the signal vector x is a random process and can be modeled as a 
first order recursive equation. In this case, a first order recursive 
estimation (or Kalman filtering) problem can be stated as follows, 



[ 2 ] 


where F and C are time-varying coefficient matrices with dimension nxn and 
raxn respectively, w. is a nxl and v. is a raxl zero mean noise vectors 
with known covariance matrices W. and V. respectively. It is assumed that 
noises w and v are uncorrelated and E tw^w . ] =E [v . v . ] =0 for all i^j. Under 
the minimum variance criterion, we want x td findxj^ for all k, such that 
E|I(x. - x, )^|| is minimized. Kalman showed that x k can be obtained by the 
recursive algorithm given as 
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The information matrix is defined as the inverse of the error covariance 
matrix P. Besides [3] , it is shown that instead of propagating the error 
covariance matrix, the Kalman filtering problem can be solved by propagat 
ing the information matrix during the iterations. Both covariance and 
information filters are recursive since the current updating depends only 
on results from previous stage. The choice between covariance filter and 
information filter depends on the values of n and m. When n>m, which is 
usually the case, the original Kalman filtering is chosen to avoid the 
inverse of the nxn matrix. However, Kalman algorithm is known for its 
poor numerical properties, especially for non-observable coefficient mat 
rices. The original Kalman filter needs an approximate 0(n' 9 ) multiplica 
tion time for each iteration. If m>l, computation of a matrix inversion 
is inevitable. Since all equations are sequential in manner, if real time 
computation is required for a Kalman filtering problem, some modifications 
must be done to insure the capability for parallel computation. Among 
many possible modified algorithms, square-root filtering have been proved 
to have computational efficiency and robust numerical properties under 
finite word length effect (Kaminski 1971) . The main advantage of the 
square root filter is that we can handle the covariance matrix by its 
square root form which has condition number smaller than the original one. 
Therefore, for ill-conditioned problems, when we used the square root fil 
ter with a single precision machine, we can expect the same numerical 
result as if we have used the original algorithm on a double precision 
machine. Updating processings for both square root covariance filter and 
square root information filter can be expressed in matrix forms and 
handled by the QR decomposition method which is capable of systolic array 
implementation. However, only square-root information filter allows us to 
update the estimated state vector as well as the information matrix by 
using the same transformation matrix Q. When both updated covariance mat 
rix and state vector are important to us, we find square- root information 
filter is a better solution for the systolic array implementation . The 
square-root information filter requires computation of the inverse of the 
coefficient matrix F, which will cause bad numerical properties for F 
being near singular. One version of the square root information matrix 
method for Kalman filtering was considered by Paige and Saunders (1977) . 

It is shewn that by using whitening processing through Cholesky deconposi 
tion, the Kalman filtering can be represented as a siitple least-squares 
problem. This approach does not require the computation of the inverse of 
the matrix F and is more suitable for systolic array implementation. 


The whitening processing can be briefly described as below. Assume 
W=L, L ' and V=L L ' are the Cholesky deconposi tion of covariance 
matrices W and ^.^Jtfith W~ =L L ' and V*=L L ' , it can be proved that 
L =L and L =L . w, =L 'w* Snd v.=L 'v. v are whitened noises with 
iSenSity covahaiice mat!fic&. k k v k 


Denote F=L 'F, C=L 'C, and =L 'y. . We can express the whitened 
System equations in the matrix-vector form as 
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We can see that R-, i=l,2...k, in [5] are all upper triangular matrices, 
and x. , the optimum estimated vector at time k, depends only on the last 
line, i.e. , R. x. =y, . Furthermore, at T*=k+1, the updating equation 
depends on the last row of [5] only. That is, the QR decomposition at 
T^k+1 only depends on a (2n+m)x(2rH-l) matrix as in [6]. When the QR 
decomposition of [6] is completed, we have R . (upper triangular) and 
y k+ ^ ready for iteration of next stage. 


[6] 
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where * is the term used to compute the residue. 


The upper triangular matrix R. can be shown to be the square-root of the 
inverse of the error covariance matrix P. =E[ (x. -x. ) (x. -x. ) '] . That is, 
this algorithm, which propagates the square root information matrix for 
next iteration, is actually a modified square-root information filtering. 


4. SYSTOLIC ARRAY IMFUHEtflATIONS FOR KALMAN FILTERING 


From last section, we can see that the basic operations for square root 
Kalman filtering can be described in two parts. The first one, whitening 
processing includes operations such as Cholesky decomposition, inverse of 
triangular matrix, and matrix multiplication. Secondly, the QR decomposi 
tion is applied. Obviously, these two parts can be operated in parallel. 
That is, we can start the whitening processing for the (k+l)st iteration 
as well as the QR decomposition for the k-th iteration at the same time in 
a pipelined manner. 

The original square-root information filter involves the computation of 
the inverse of the coefficient matrix F which not only increases the com 
putational complexity but also causes bad numerical properties when coef 
ficient matrix F is singular or near singular. This shortcoming can be 



recovered by choosing the modified square root information filtering in 
143 • As shown from I4]-16J, formulation of the modified square-root 
information filter involves only multiplication between coefficient mat 
rices and the inverse of the square root noise covariance matrices. For 
noise with positive definite covariance, square root covariance matrix 
always exists. 

4.1 Whitening Processing 

The whitening processing is done by multiplying the coefficient matrix 
with a whitening operator L* where (LL') -1 is the given covariance matrix 
of the additive noise. Since a covariance matrix is a positive definite 
symmetric matrix, the square root matrix can be obtained by the Cholesky 
decomposition. A triangular systolic array for Cholesky decomposition is 
designed for this purpose with outputs skewed to match the input format of 
the QR systolic array. 

The inversion of a upper triangular matrix is sinple after we built the 
basic systolic array for QR decomposition. The idea for the inversion of 
a upper triangular matrix is the same as solving the back substitution. 

With UU -1 =I, let ... u ], with ji- being a nxl column 

vector. A matrix inversion^can be divided into n sets of linear equa 
tions, each having the form of i=l,2,...n, where £• is a nxl 

column vector with i“ element equals to 1, and all othersToeing 0, and 
can be solved by a systolic array. 

4.2 QR Decomposition for Kalman Filtering 

Equation [6] suggests that X. can be solved as a least-squares solution by 
a 2nx2n QRjsystolic array. However, serious delay will be caused by the 
fact that R. and R. + , are not in-place computations. That is, we have 
trouole to move the newly formed R from the upper-right corner to the low 
er-left corner in our triangular array for the next iteration. That is, 
the computation at stage k+1 can not start until the last element of R k is 
completed. In this "waiting" period, most of processing units are idle 
and the pipeline is empty. It will cause delay for at least 2n clock 
periods. 

This disadvantage can be overcome by in-place computations for R. and 
R. + . . This can be done by partitioning the original matrix into K two 
strips, and perform the partitioned QR decomposition by the systolic array 
structure proposed in Figure 2. In this approach, a nxn QR systolic array 
as well as a rotation array which consists of nx (n+1) internal cells are 
used. Once elements of R. are formed, it is ready to be used for 
computations at stage k+1. Here we need only to pass transformed elements 
generated by the first strip to the rectangular rotation array for the 
pre-processing of the second strip. This input format is shown in Figure 
3. Since all these can be done in fully pipelined manner and in-place 
computations are obtained, complicated inter-cell connection and control 
scheme can both be avoided. To obtain the estimated value St., we can 
just append an identity matrix I after the second strip, and we get result 
every 3n+m clock periods. 



5. CONCLUSION 


In this paper, we first survey existing algorithms for least-squares esti 
nations by systolic arrays. Basic comparisons are made based on computa 
tion and inter-cell connection complexities of elementary units. Finally, 
by choosing the square-root information filtering algorithm, we showed a 
simple way to solve the Kalman filtering as a least-squares problem that 
can be processed by systolic arrays. Systolic array for Cholesky decompo 
sition is also proposed for whitening processing. By manipulating the 
data properly, the Kalman filtering can be processed under fully pipelined 
manner. There is no special constraint on our system equations and stan 
dard time-varying coefficient matrices and non-stationary colored noises 
are assumed in our model. Host of the processing units we need for this 
square root information filter do not involve square-root computations. 

The only exception is the computations for the Cholesky decomposition. 
However, for pipelined operation between whitening processing and QR 
decomposition, the later certainly involved more computational work than 
the former. Since there is only n square-root computations required in 
each iteration as compared with the operations required for QR decomposi 
tion, Cholesky decomposition will not become the bottleneck for this algo 
rithm. For many real life problems where we can assume noises are sta 
tionary, then covariance matrices W and V are fixed during our operation. 
In this case, inversed square-root covariance matrices can be obtained by 
pre-processing and our Kalman filtering can be solved as a simple least- 
square problem. Since all operations can be performed by the designed 
systolic array processing, which have the input/output formats matched to 
each other, the entire hardware design can be viewed as a pipelined struc 
ture. The estimated vector can be obtained with the 0(n) in time while 
compared with the 0(n J ) for the original Kalman filter. Finally, since 
this is a square root matrix operation, good numerical property can also 
be obtained. 
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Figure 1: Systolic array for least-squares estimation. 





Figure 2: QR systolic array for Kalman filtering 














